Source Code: https://github.com/djlofland/DATA622_S2021_Group2/tree/master/FinalProject


Load Data & EDA


Information about the dataset

Before loading our dataset, and discussing our data exploration process, I’ll quickly summarize the dataset that we’ll be using for our machine learning analysis. The dataset is part of UCI’s Cleveland Heart Disease data file, which consists of 303 individuals and various health-related metrics. The dataset we’ll be using includes 14 attributes retrieved from patient tests and can be used for machine learning and classification analysis. Ultimately, these attributes will be examined to determine their effectiveness at predicting whether or not a particular patient has Heart Disease. You can find more information about these 14 attributes below:

Data Dictionary:

  • age: age in years
  • sex: sex (1 = male; 0 = female)
  • cp: chest pain type
    • `Value 1: typical angina
    • Value 2: atypical angina
    • Value 3: non-anginal pain
    • Value 4: asymptomatic
  • trestbps: resting blood pressure (in mm Hg on admission to the hospital)
  • chol: serum cholesterol in mg/dl
  • fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  • restecg: resting electrocardiographic results
    • Value 0: normal
    • Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    • Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
  • thalach: maximum heart rate achieved
  • exang: exercise induced angina (1 = yes; 0 = no)
  • oldpeak: ST depression induced by exercise relative to rest
  • slope: the slope of the peak exercise ST segment
    • Value 1: upsloping
    • Value 2: flat
    • Value 3: downsloping
  • ca: number of major vessels (0-3) colored by fluoroscopy
  • thal: 3 = normal; 6 = fixed defect; 7 = reversible defect
  • target: diagnosis of heart disease (angiographic disease status)
    • Value 0: < 50% diameter narrowing
    • Value 1: > 50% diameter narrowing

Given this context, we are now ready to load our data into R:

df <- read.csv("./data/heart.csv", header = TRUE)
#df <- df %>% 
#  rename(age = "age")

Exploratory Data Analysis

First, we can check the shape of the dataset. It looks like there are 303 rows and 14 columns in our imported data.

dim(df)
## [1] 303  14

Missing values in the dataset

We can also check to see if there is missing data.

plot_missing(df)

As we can see above, there aren’t any missing values across our dataset, which will make it easier for us as we start to prepare our dataset for machine learning analysis. Next, we’ll look at a full summary of our features, including rudimentary distributions of each of our continuous variables:

skimr::skim(df)
Data summary
Name df
Number of rows 303
Number of columns 14
_______________________
Column type frequency:
numeric 14
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 54.37 9.08 29 47.5 55.0 61.0 77.0 ▁▆▇▇▁
sex 0 1 0.68 0.47 0 0.0 1.0 1.0 1.0 ▃▁▁▁▇
cp 0 1 0.97 1.03 0 0.0 1.0 2.0 3.0 ▇▃▁▅▁
trestbps 0 1 131.62 17.54 94 120.0 130.0 140.0 200.0 ▃▇▅▁▁
chol 0 1 246.26 51.83 126 211.0 240.0 274.5 564.0 ▃▇▂▁▁
fbs 0 1 0.15 0.36 0 0.0 0.0 0.0 1.0 ▇▁▁▁▂
restecg 0 1 0.53 0.53 0 0.0 1.0 1.0 2.0 ▇▁▇▁▁
thalach 0 1 149.65 22.91 71 133.5 153.0 166.0 202.0 ▁▂▅▇▂
exang 0 1 0.33 0.47 0 0.0 0.0 1.0 1.0 ▇▁▁▁▃
oldpeak 0 1 1.04 1.16 0 0.0 0.8 1.6 6.2 ▇▂▁▁▁
slope 0 1 1.40 0.62 0 1.0 1.0 2.0 2.0 ▁▁▇▁▇
ca 0 1 0.73 1.02 0 0.0 0.0 1.0 4.0 ▇▃▂▁▁
thal 0 1 2.31 0.61 0 2.0 2.0 3.0 3.0 ▁▁▁▇▆
target 0 1 0.54 0.50 0 0.0 1.0 1.0 1.0 ▇▁▁▁▇
df_viz <- df %>% 
  mutate(sex = if_else(sex == 1, "Male", "Female"),
         fbs = if_else(fbs == 1, ">120", "<=120"),
         exang = if_else(exang == 1, "Yes" ,"No"),
         cp = if_else(cp == 1, "Atypical Angina",
                      if_else(cp == 2, "Non-anginal pain", "Asymptomatic")),
         restecg = if_else(restecg == 0, "Normal",
                           if_else(restecg == 1, "Abnormality", "probable or definite")),
         slope = as.factor(slope),
         ca = as.factor(ca),
         thal = as.factor(thal),
         target = if_else(target == 1, "Has heart disease", "Does not have heart disease")) %>% 
  mutate_if(is.character, as.factor) %>% 
  dplyr::select(target, sex, fbs, exang, cp, restecg, slope, ca, thal, everything())

Interestingly, we can also see that there’s a pretty even split between those that have a presence and absence of heart disease:

# Bar plot for target (Heart disease) 
ggplot(df_viz, aes(x=target, fill=target)) + 
  geom_bar() +
  xlab("Heart Disease") +
  ylab("Count") +
  ggtitle("Analysis of Presence and Absence of Heart Disease") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence"))

prop.table(table(df_viz$target))
## 
## Does not have heart disease           Has heart disease 
##                   0.4554455                   0.5445545

We can see that there is roughly a 50/50 split in this dataset of those that do have heart disease from those that do not have heart disease. This will come in handy later when we set up our machine learning analyses.

Additionally, when looking at the summary distributions, we can see that our patient population skews above 50 years, with an average age of individuals in this dataset at about 54 years old. The oldest individual is 77 years old, and the youngest is 29. We can take a look at the age frequency counts here:

# Counting the frequency of the values of the age
hist(df_viz$age, labels=TRUE, main="Histogram of Age", xlab="Age Class", 
     ylab="Frequency", col="steelblue")

df_viz %>%
  ggplot(aes(x=age, y=chol, size=chol, color=target)) +
  geom_density2d() + 
  geom_point(alpha=0.4) + 
  xlab("Age") +
  ylab("Cholesterol")

Interestingly from the plot above, when we take a look at the relationship between a patient’s age, cholesterol and whether or not they have heart disease, we can see that those with positive diagnoses tend to be clustered between the ages of 40 and 70 years and relatively high cholesterol levels.

We can explore this further by taking a look at these in percentages:

df %>% 
  group_by(target) %>% 
  summarise(mean=mean(trestbps))
## # A tibble: 2 x 2
##   target  mean
##    <int> <dbl>
## 1      0  134.
## 2      1  129.

Feature Plots

For the continuous variables, we can examine the distributions, broken out by the target variable.

cor_heart <- cor(df[,c(4,5,8,10,14)])
cor_heart
##             trestbps         chol      thalach     oldpeak      target
## trestbps  1.00000000  0.123174207 -0.046697728  0.19321647 -0.14493113
## chol      0.12317421  1.000000000 -0.009939839  0.05395192 -0.08523911
## thalach  -0.04669773 -0.009939839  1.000000000 -0.34418695  0.42174093
## oldpeak   0.19321647  0.053951920 -0.344186948  1.00000000 -0.43069600
## target   -0.14493113 -0.085239105  0.421740934 -0.43069600  1.00000000
corrplot(cor_heart, method = "ellipse", type="upper",)

ggcorrplot(cor_heart,lab = T)

ggcorr(cor_heart, label = T, label_round = 2)

[TODO: More explanation of these corr plots]

Dummy Columns

We’ll need to dummy code our categorical variables. This process will create new columns for each value and assign a 0 or 1. Note that dummy encoding typically drops one value which becomes the baseline. So if we have a categorical feature with five unique values, we will have four columns. If all columns are 0, that represents the reference value. This helps reduce the number of necessary columns. With dummy columns in place, we need to remove our original variables from this dataset.

# dummy encode our categorical features
df_dummy <- dummyVars(~ 0 + ., drop2nd=TRUE, data = df_viz)
df_dummy <- data.frame(predict(df_dummy, newdata = df_viz))

Splitting the Dataset into Training and Testing

Our first step will be to separate the data into training and test dataset. This way, we can test the accuracy by using our holdout test set later on. We decided to perform a conventional 80/20 training to testing data split on our original dataframe.

df$sex<-as.factor(df$sex)
levels(df$sex)<-c("Female","Male")

df$cp<-as.factor(df$cp)
levels(df$cp)<-c("typical","atypical","non-anginal","asymptomatic")

df$fbs<-as.factor(df$fbs)
levels(df$fbs)<-c("False","True")

df$restecg<-as.factor(df$restecg)
levels(df$restecg)<-c("normal","stt","hypertrophy")

df$exang<-as.factor(df$exang)
levels(df$exang)<-c("No","Yes")

df$slope<-as.factor(df$slope)
levels(df$slope)<-c("upsloping","flat","downsloping")

df$ca<-as.factor(df$ca)

df$thal<-as.factor(df$thal)

df$target<-as.factor(df$target)
levels(df$target)<-c("No", "Yes")
str(df)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 1 2 2 2 ...
##  $ cp      : Factor w/ 4 levels "typical","atypical",..: 4 3 2 2 1 1 2 2 3 3 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : Factor w/ 2 levels "False","True": 2 1 1 1 1 1 1 1 2 1 ...
##  $ restecg : Factor w/ 3 levels "normal","stt",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 1 1 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : Factor w/ 3 levels "upsloping","flat",..: 1 1 3 3 3 2 2 3 3 3 ...
##  $ ca      : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ thal    : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
##  $ target  : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
set.seed(123)
df2      <- data.frame(df)
index    <- createDataPartition(df2$target, p=0.8, list = FALSE)
df_train <- df2[index,]
df_test  <- df2[-index,]

Supervised Machine Learning

Random Forest Analysis

Random forest is a Supervised Learning algorithm which uses ensemble learning method for classification and regression. It is a bagging technique and not a boosting technique. The trees in random forests are run in parallel. There is no interaction between these trees while building the trees. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

set.seed(123)
fit.forest <- randomForest(target ~ ., data = df_train, importance=TRUE, ntree=2000)

# display model details
fit.forest
## 
## Call:
##  randomForest(formula = target ~ ., data = df_train, importance = TRUE,      ntree = 2000) 
##                Type of random forest: classification
##                      Number of trees: 2000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 15.64%
## Confusion matrix:
##     No Yes class.error
## No  90  21   0.1891892
## Yes 17 115   0.1287879
plot(fit.forest)

Red line represents MCR of class not having heart diseases, green line represents MCR of class having heart diseases and black line represents overall MCR or OOB error. Overall error rate is what we are interested in which seems considerably good.

# show which feature were most important
importance(fit.forest)
##                 No         Yes MeanDecreaseAccuracy MeanDecreaseGini
## age       3.274216 11.68285507           11.2940461         9.039992
## sex      12.563559 24.71542708           26.2920990         4.328270
## cp       34.410444 21.32962601           38.4183067        14.942477
## trestbps  3.940274  0.01409797            2.4034416         8.233993
## chol     -6.365641  4.85376733           -0.5700170         8.909245
## fbs      -1.543565  2.25881479            0.7770673         1.007893
## restecg   2.857510  2.83825672            3.8776566         2.364278
## thalach  13.554776 23.72271362           26.1957798        15.123193
## exang    11.131211  8.97793541           13.8987259         5.772066
## oldpeak  22.913261 20.29136972           29.7752681        12.770398
## slope    10.317252  4.69582209           10.2883731         4.141878
## ca       35.506938 44.90702661           51.4059266        17.077924
## thal     27.501678 35.35556739           41.2806038        15.420881
varImpPlot(fit.forest,type=2)

After fitting our model, we can now use it to create predictions on our holdout test set to evaluate its performance.

forest.pred <- predict(fit.forest, newdata = df_test, type = "class")
forest.cm_train <- confusionMatrix(forest.pred, df_test$target)
forest.cm_train
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  21   3
##        Yes  6  30
##                                          
##                Accuracy : 0.85           
##                  95% CI : (0.7343, 0.929)
##     No Information Rate : 0.55           
##     P-Value [Acc > NIR] : 8.068e-07      
##                                          
##                   Kappa : 0.6939         
##                                          
##  Mcnemar's Test P-Value : 0.505          
##                                          
##             Sensitivity : 0.7778         
##             Specificity : 0.9091         
##          Pos Pred Value : 0.8750         
##          Neg Pred Value : 0.8333         
##              Prevalence : 0.4500         
##          Detection Rate : 0.3500         
##    Detection Prevalence : 0.4000         
##       Balanced Accuracy : 0.8434         
##                                          
##        'Positive' Class : No             
## 

The accuracy is 0.85. The test result is not bad. [TODO: Discuss sensitivity and specificity]

[TODO: Need more explanation]

Gradient Boosting

set.seed(123)

# build gradient boosted model
fit.boost <- gbm(target~., df_train,
             distribution = "gaussian",
             n.trees = 2000,
             cv.folds = 5)

# display model details
fit.boost
## gbm(formula = target ~ ., distribution = "gaussian", data = df_train, 
##     n.trees = 2000, cv.folds = 5)
## A gradient boosted model with gaussian loss function.
## 2000 iterations were performed.
## The best cross-validation iteration was 75.
## There were 13 predictors of which 13 had non-zero influence.
sqrt(min(fit.boost$cv.error))
## [1] 0.3479688
summary(fit.boost)

##               var   rel.inf
## chol         chol 15.960107
## thalach   thalach 14.677073
## trestbps trestbps 12.414133
## cp             cp 10.797779
## age           age 10.318393
## oldpeak   oldpeak  9.502784
## ca             ca  9.257851
## thal         thal  7.197664
## slope       slope  2.554913
## exang       exang  2.086447
## fbs           fbs  1.853905
## restecg   restecg  1.778332
## sex           sex  1.600619
plot(fit.boost, i="chol")

plot(fit.boost, i="thalach")

# plot loss function as a result of n trees added to the ensemble
gbm.perf(fit.boost, method = "cv")

## [1] 75
gbm.perf(fit.boost, method = "OOB")
## OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive. Using cv_folds>1 when calling gbm usually results in improved predictive performance.

## [1] 44
## attr(,"smoother")
## Call:
## loess(formula = object$oobag.improve ~ x, enp.target = min(max(4, 
##     length(x)/10), 50))
## 
## Number of Observations: 2000 
## Equivalent Number of Parameters: 39.99 
## Residual Standard Error: 0.0003482

The plots indicating the optimum number of trees based on the technique we used. The green line indicates the error on the test data, and the blue dotted line points to the optimum number of iterations. We can observe that beyond a certain point (42 iterations for the cross-validation method and 45 for the OOB method), the test data error appears to increase because of overfitting.

According to the cross-validation method, we choose 42 as the number of trees to predict the test set.

boostPre <- predict.gbm(fit.boost, 
                        df_test, 
                        n.trees = 42)
boostPre <- ifelse(boostPre < 1.5, 'No', 'Yes')

boostPre <- as.factor(boostPre)
boost.cm <- confusionMatrix(boostPre, df_test$target)
boost.cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  19   3
##        Yes  8  30
##                                           
##                Accuracy : 0.8167          
##                  95% CI : (0.6956, 0.9048)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 1.344e-05       
##                                           
##                   Kappa : 0.6233          
##                                           
##  Mcnemar's Test P-Value : 0.2278          
##                                           
##             Sensitivity : 0.7037          
##             Specificity : 0.9091          
##          Pos Pred Value : 0.8636          
##          Neg Pred Value : 0.7895          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3167          
##    Detection Prevalence : 0.3667          
##       Balanced Accuracy : 0.8064          
##                                           
##        'Positive' Class : No              
## 

[TODO: Discussion … accuracy, sensitvity, specificity]

Support Vector Machine

In machine learning, Support vector machine (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. It is mostly used in classification problems. In this algorithm, each data item is plotted as a point in n-dimensional space (where n is number of features), with the value of each feature being the value of a particular coordinate. Then, classification is performed by finding the hyper-plane that best differentiates the two classes.

In addition to performing linear classification, SVMs can efficiently perform a non-linear classification, implicitly mapping their inputs into high-dimensional feature spaces.

set.seed(123)
fit.svm <- svm(target~., 
               data=df_train, 
               kernel="radial", 
               cost=10, 
               scale = FALSE)
fit.svm
## 
## Call:
## svm(formula = target ~ ., data = df_train, kernel = "radial", cost = 10, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  242
svm.pred <- predict(fit.svm, 
                    newdata = df_test)

svm.cm_train <- confusionMatrix(forest.pred, df_test$target)
svm.cm_train
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  21   3
##        Yes  6  30
##                                          
##                Accuracy : 0.85           
##                  95% CI : (0.7343, 0.929)
##     No Information Rate : 0.55           
##     P-Value [Acc > NIR] : 8.068e-07      
##                                          
##                   Kappa : 0.6939         
##                                          
##  Mcnemar's Test P-Value : 0.505          
##                                          
##             Sensitivity : 0.7778         
##             Specificity : 0.9091         
##          Pos Pred Value : 0.8750         
##          Neg Pred Value : 0.8333         
##              Prevalence : 0.4500         
##          Detection Rate : 0.3500         
##    Detection Prevalence : 0.4000         
##       Balanced Accuracy : 0.8434         
##                                          
##        'Positive' Class : No             
## 

[Discuss: …]

svm.tune <- tune(e1071::svm, target~.,
                 data = df_train,
                 ranges=list(cost=c(0.001, 0.01, 0.1, 1, 5, 10, 100),
                 kernel=c("linear", "polynomial", "radial"),
                 gamma=c(0.5, 1, 2, 3, 4)))
summary(svm.tune)
## 
## Parameter tuning of 'e1071::svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost     kernel gamma
##     1 polynomial   0.5
## 
## - best performance: 0.1556667 
## 
## - Detailed performance results:
##      cost     kernel gamma     error dispersion
## 1   1e-03     linear   0.5 0.4565000 0.09035407
## 2   1e-02     linear   0.5 0.2138333 0.05730065
## 3   1e-01     linear   0.5 0.1806667 0.08135124
## 4   1e+00     linear   0.5 0.1766667 0.07442786
## 5   5e+00     linear   0.5 0.1685000 0.06255590
## 6   1e+01     linear   0.5 0.1685000 0.06255590
## 7   1e+02     linear   0.5 0.1770000 0.07846836
## 8   1e-03 polynomial   0.5 0.2633333 0.09427108
## 9   1e-02 polynomial   0.5 0.1891667 0.07329229
## 10  1e-01 polynomial   0.5 0.1763333 0.07644816
## 11  1e+00 polynomial   0.5 0.1556667 0.07546645
## 12  5e+00 polynomial   0.5 0.1556667 0.07546645
## 13  1e+01 polynomial   0.5 0.1556667 0.07546645
## 14  1e+02 polynomial   0.5 0.1556667 0.07546645
## 15  1e-03     radial   0.5 0.4565000 0.09035407
## 16  1e-02     radial   0.5 0.4565000 0.09035407
## 17  1e-01     radial   0.5 0.4565000 0.09035407
## 18  1e+00     radial   0.5 0.2061667 0.07676343
## 19  5e+00     radial   0.5 0.1853333 0.05978253
## 20  1e+01     radial   0.5 0.1853333 0.05978253
## 21  1e+02     radial   0.5 0.1853333 0.05978253
## 22  1e-03     linear   1.0 0.4565000 0.09035407
## 23  1e-02     linear   1.0 0.2138333 0.05730065
## 24  1e-01     linear   1.0 0.1806667 0.08135124
## 25  1e+00     linear   1.0 0.1766667 0.07442786
## 26  5e+00     linear   1.0 0.1685000 0.06255590
## 27  1e+01     linear   1.0 0.1685000 0.06255590
## 28  1e+02     linear   1.0 0.1770000 0.07846836
## 29  1e-03 polynomial   1.0 0.2055000 0.07478384
## 30  1e-02 polynomial   1.0 0.1763333 0.07644816
## 31  1e-01 polynomial   1.0 0.1556667 0.07546645
## 32  1e+00 polynomial   1.0 0.1556667 0.07546645
## 33  5e+00 polynomial   1.0 0.1556667 0.07546645
## 34  1e+01 polynomial   1.0 0.1556667 0.07546645
## 35  1e+02 polynomial   1.0 0.1556667 0.07546645
## 36  1e-03     radial   1.0 0.4565000 0.09035407
## 37  1e-02     radial   1.0 0.4565000 0.09035407
## 38  1e-01     radial   1.0 0.4565000 0.09035407
## 39  1e+00     radial   1.0 0.3455000 0.09543937
## 40  5e+00     radial   1.0 0.3086667 0.05259207
## 41  1e+01     radial   1.0 0.3086667 0.05259207
## 42  1e+02     radial   1.0 0.3086667 0.05259207
## 43  1e-03     linear   2.0 0.4565000 0.09035407
## 44  1e-02     linear   2.0 0.2138333 0.05730065
## 45  1e-01     linear   2.0 0.1806667 0.08135124
## 46  1e+00     linear   2.0 0.1766667 0.07442786
## 47  5e+00     linear   2.0 0.1685000 0.06255590
## 48  1e+01     linear   2.0 0.1685000 0.06255590
## 49  1e+02     linear   2.0 0.1770000 0.07846836
## 50  1e-03 polynomial   2.0 0.1805000 0.07181101
## 51  1e-02 polynomial   2.0 0.1556667 0.07546645
## 52  1e-01 polynomial   2.0 0.1556667 0.07546645
## 53  1e+00 polynomial   2.0 0.1556667 0.07546645
## 54  5e+00 polynomial   2.0 0.1556667 0.07546645
## 55  1e+01 polynomial   2.0 0.1556667 0.07546645
## 56  1e+02 polynomial   2.0 0.1556667 0.07546645
## 57  1e-03     radial   2.0 0.4565000 0.09035407
## 58  1e-02     radial   2.0 0.4565000 0.09035407
## 59  1e-01     radial   2.0 0.4565000 0.09035407
## 60  1e+00     radial   2.0 0.4440000 0.09836377
## 61  5e+00     radial   2.0 0.4398333 0.09040529
## 62  1e+01     radial   2.0 0.4398333 0.09040529
## 63  1e+02     radial   2.0 0.4398333 0.09040529
## 64  1e-03     linear   3.0 0.4565000 0.09035407
## 65  1e-02     linear   3.0 0.2138333 0.05730065
## 66  1e-01     linear   3.0 0.1806667 0.08135124
## 67  1e+00     linear   3.0 0.1766667 0.07442786
## 68  5e+00     linear   3.0 0.1685000 0.06255590
## 69  1e+01     linear   3.0 0.1685000 0.06255590
## 70  1e+02     linear   3.0 0.1770000 0.07846836
## 71  1e-03 polynomial   3.0 0.1556667 0.07546645
## 72  1e-02 polynomial   3.0 0.1556667 0.07546645
## 73  1e-01 polynomial   3.0 0.1556667 0.07546645
## 74  1e+00 polynomial   3.0 0.1556667 0.07546645
## 75  5e+00 polynomial   3.0 0.1556667 0.07546645
## 76  1e+01 polynomial   3.0 0.1556667 0.07546645
## 77  1e+02 polynomial   3.0 0.1556667 0.07546645
## 78  1e-03     radial   3.0 0.4565000 0.09035407
## 79  1e-02     radial   3.0 0.4565000 0.09035407
## 80  1e-01     radial   3.0 0.4565000 0.09035407
## 81  1e+00     radial   3.0 0.4523333 0.09931618
## 82  5e+00     radial   3.0 0.4481667 0.09393802
## 83  1e+01     radial   3.0 0.4481667 0.09393802
## 84  1e+02     radial   3.0 0.4481667 0.09393802
## 85  1e-03     linear   4.0 0.4565000 0.09035407
## 86  1e-02     linear   4.0 0.2138333 0.05730065
## 87  1e-01     linear   4.0 0.1806667 0.08135124
## 88  1e+00     linear   4.0 0.1766667 0.07442786
## 89  5e+00     linear   4.0 0.1685000 0.06255590
## 90  1e+01     linear   4.0 0.1685000 0.06255590
## 91  1e+02     linear   4.0 0.1770000 0.07846836
## 92  1e-03 polynomial   4.0 0.1556667 0.07546645
## 93  1e-02 polynomial   4.0 0.1556667 0.07546645
## 94  1e-01 polynomial   4.0 0.1556667 0.07546645
## 95  1e+00 polynomial   4.0 0.1556667 0.07546645
## 96  5e+00 polynomial   4.0 0.1556667 0.07546645
## 97  1e+01 polynomial   4.0 0.1556667 0.07546645
## 98  1e+02 polynomial   4.0 0.1556667 0.07546645
## 99  1e-03     radial   4.0 0.4565000 0.09035407
## 100 1e-02     radial   4.0 0.4565000 0.09035407
## 101 1e-01     radial   4.0 0.4565000 0.09035407
## 102 1e+00     radial   4.0 0.4565000 0.09035407
## 103 5e+00     radial   4.0 0.4523333 0.09931618
## 104 1e+01     radial   4.0 0.4523333 0.09931618
## 105 1e+02     radial   4.0 0.4523333 0.09931618
bestmod <- svm.tune$best.model
bestmod
## 
## Call:
## best.tune(method = e1071::svm, train.x = target ~ ., data = df_train, 
##     ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100), kernel = c("linear", 
##         "polynomial", "radial"), gamma = c(0.5, 1, 2, 3, 4)))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##      coef.0:  0 
## 
## Number of Support Vectors:  134
svm.pred2 <- predict(bestmod, newdata = df_test, type = "class")

(svm.cm_train2 <- confusionMatrix(forest.pred, df_test$target))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  21   3
##        Yes  6  30
##                                          
##                Accuracy : 0.85           
##                  95% CI : (0.7343, 0.929)
##     No Information Rate : 0.55           
##     P-Value [Acc > NIR] : 8.068e-07      
##                                          
##                   Kappa : 0.6939         
##                                          
##  Mcnemar's Test P-Value : 0.505          
##                                          
##             Sensitivity : 0.7778         
##             Specificity : 0.9091         
##          Pos Pred Value : 0.8750         
##          Neg Pred Value : 0.8333         
##              Prevalence : 0.4500         
##          Detection Rate : 0.3500         
##    Detection Prevalence : 0.4000         
##       Balanced Accuracy : 0.8434         
##                                          
##        'Positive' Class : No             
## 

According to the Confusion matrix, Our model Accuracy is 0.8333. [TODO: Discuss]

Supervised Model Performance

Random_Forests <- c(forest.cm_train$overall[1], 
                    forest.cm_train$byClass[1], 
                    forest.cm_train$byClass[2], 
                    forest.cm_train$byClass[5], 
                    forest.cm_train$byClass[6], 
                    forest.cm_train$byClass[7])

Gradient_Boosting <- c(boost.cm$overall[1],
                       boost.cm$byClass[1],
                       boost.cm$byClass[2],
                       boost.cm$byClass[5],
                       boost.cm$byClass[6],
                       boost.cm$byClass[7])

SVM <- c(svm.cm_train2$overall[1],
         svm.cm_train2$byClass[1],
         svm.cm_train2$byClass[2],
         svm.cm_train2$byClass[5],
         svm.cm_train2$byClass[6],
         svm.cm_train2$byClass[7])

result <- rbind(Random_Forests, Gradient_Boosting,SVM)

kable(result)
Accuracy Sensitivity Specificity Precision Recall F1
Random_Forests 0.8500000 0.7777778 0.9090909 0.8750000 0.7777778 0.8235294
Gradient_Boosting 0.8166667 0.7037037 0.9090909 0.8636364 0.7037037 0.7755102
SVM 0.8500000 0.7777778 0.9090909 0.8750000 0.7777778 0.8235294

[TODO: Discuss summary]

Unsupervised Machine Learning

Principle Component Analysis

Principal Component Analysis (PCA) is one of the most commonly used unsupervised machine learning algorithms across a variety of applications: exploratory data analysis, dimensionality reduction and information compression. It is a useful technique for exploratory data analysis, allowing us to better visualize the variation present in a dataset with many variables. For our particular use case here, it appears that many of the questionnaire variables fall on Likert scales, which when prepared for analysis are extended to dummy variables. This creates many additional features and can make analysis more difficult due to an increased number of dimensions. Therefore, utilizing PCA to reduce the number of dimensions on our entire dataset and measure the amount of variance explained is beneficial. In order to do this, we’ll use the prcomp() function:

df_dummy.pca <- prcomp(df_dummy, center = TRUE, scale. = TRUE)
summary(df_dummy.pca)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.4969 1.59479 1.50965 1.36202 1.32718 1.27564 1.22439
## Proportion of Variance 0.2011 0.08204 0.07352 0.05984 0.05682 0.05249 0.04836
## Cumulative Proportion  0.2011 0.28316 0.35667 0.41652 0.47334 0.52583 0.57419
##                            PC8    PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.15506 1.1342 1.06882 1.03041 1.02755 0.98902 0.98278
## Proportion of Variance 0.04304 0.0415 0.03685 0.03425 0.03406 0.03155 0.03116
## Cumulative Proportion  0.61723 0.6587 0.69557 0.72982 0.76388 0.79544 0.82659
##                           PC15    PC16    PC17    PC18   PC19    PC20    PC21
## Standard deviation     0.94244 0.93529 0.88443 0.86181 0.8536 0.76794 0.64136
## Proportion of Variance 0.02865 0.02822 0.02523 0.02396 0.0235 0.01902 0.01327
## Cumulative Proportion  0.85524 0.88346 0.90870 0.93265 0.9562 0.97518 0.98845
##                           PC22     PC23      PC24      PC25     PC26      PC27
## Standard deviation     0.59839 5.87e-15 2.144e-15 1.336e-15 1.15e-15 8.766e-16
## Proportion of Variance 0.01155 0.00e+00 0.000e+00 0.000e+00 0.00e+00 0.000e+00
## Cumulative Proportion  1.00000 1.00e+00 1.000e+00 1.000e+00 1.00e+00 1.000e+00
##                             PC28      PC29     PC30      PC31
## Standard deviation     8.174e-16 5.828e-16 3.23e-16 2.387e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.00e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.00e+00 1.000e+00
fviz_eig(df_dummy.pca)

#compute standard deviation of each principal component
std_dev <- df_dummy.pca$sdev
#compute variance
pr_var <- std_dev^2
prop_varex <- pr_var/sum(pr_var)
round(prop_varex[1:10], 2)
##  [1] 0.20 0.08 0.07 0.06 0.06 0.05 0.05 0.04 0.04 0.04

The first principal component in our example therefore explains 20% of the variability, and the second principal component explains 8%. Together, the first ten principal components explain 69% of the variability. And we proceed to plot the PVE and cumulative PVE to provide us our scree plots as we did earlier.

#scree plot
plot(prop_varex, xlab = "Principal Component",
             ylab = "Proportion of Variance Explained",
             type = "b")

#cumulative scree plot
plot(cumsum(prop_varex), xlab = "Principal Component",
              ylab = "Cumulative Proportion of Variance Explained",
              type = "b")

library("FactoMineR")
library("factoextra")

eig.val <- get_eigenvalue(df_dummy.pca)
eig.val
##          eigenvalue variance.percent cumulative.variance.percent
## Dim.1  6.234534e+00     2.011140e+01                    20.11140
## Dim.2  2.543346e+00     8.204344e+00                    28.31574
## Dim.3  2.279033e+00     7.351718e+00                    35.66746
## Dim.4  1.855109e+00     5.984222e+00                    41.65168
## Dim.5  1.761418e+00     5.681994e+00                    47.33368
## Dim.6  1.627248e+00     5.249187e+00                    52.58286
## Dim.7  1.499129e+00     4.835900e+00                    57.41877
## Dim.8  1.334161e+00     4.303745e+00                    61.72251
## Dim.9  1.286410e+00     4.149711e+00                    65.87222
## Dim.10 1.142386e+00     3.685117e+00                    69.55734
## Dim.11 1.061749e+00     3.424996e+00                    72.98233
## Dim.12 1.055861e+00     3.406003e+00                    76.38834
## Dim.13 9.781525e-01     3.155331e+00                    79.54367
## Dim.14 9.658557e-01     3.115663e+00                    82.65933
## Dim.15 8.881858e-01     2.865115e+00                    85.52445
## Dim.16 8.747641e-01     2.821820e+00                    88.34627
## Dim.17 7.822181e-01     2.523284e+00                    90.86955
## Dim.18 7.427240e-01     2.395884e+00                    93.26543
## Dim.19 7.285658e-01     2.350212e+00                    95.61565
## Dim.20 5.897297e-01     1.902354e+00                    97.51800
## Dim.21 4.113489e-01     1.326932e+00                    98.84493
## Dim.22 3.580708e-01     1.155067e+00                   100.00000
## Dim.23 3.445532e-29     1.111462e-28                   100.00000
## Dim.24 4.596664e-30     1.482795e-29                   100.00000
## Dim.25 1.783936e-30     5.754634e-30                   100.00000
## Dim.26 1.323229e-30     4.268482e-30                   100.00000
## Dim.27 7.684005e-31     2.478711e-30                   100.00000
## Dim.28 6.681098e-31     2.155193e-30                   100.00000
## Dim.29 3.396772e-31     1.095733e-30                   100.00000
## Dim.30 1.043089e-31     3.364804e-31                   100.00000
## Dim.31 5.697784e-32     1.837995e-31                   100.00000
res.pca <- PCA(df_dummy, scale.unit = TRUE, ncp = 5, graph = FALSE)
var <- get_pca_var(res.pca)
corrplot(var$cos2, is.corr=FALSE)

corrplot(var$contrib, is.corr=FALSE)

fviz_contrib(df_dummy.pca, choice = "var", axes = 1, top = 15)

fviz_contrib(df_dummy.pca, choice = "var", axes = 2, top = 15)

fviz_pca_var(df_dummy.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

[TODO: Need more explanation]

Kmeans Analysis

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms that is part of a much deep pool of data techniques and operations in the realm of Data Science. It is the fastest and most efficient algorithm to categorize data points into groups even when very little information is available about data.

fviz_nbclust(df_dummy, kmeans, method = "wss") +
  geom_vline(xintercept = 5, linetype = 2) + # add line for better visualization
  labs(subtitle = "Elbow method") # add subtitle

We can see above, that it is approximately 5. More clusters do not improve within segment variability. Therefore, we’ll perform our initial K-Means model with \(k=5\).

set.seed(42)
km_res <- kmeans(df_dummy, centers = 3, nstart = 20)
summary(km_res)
##              Length Class  Mode   
## cluster      303    -none- numeric
## centers       93    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
sil <- silhouette(km_res$cluster, dist(df_dummy))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   54          0.30
## 2       2  111          0.32
## 3       3  138          0.25

[TODO: Discuss]

fviz_cluster(km_res, df_dummy, ellipse.type = "norm")

To provide some context around the clusters that were generated by our algorithm, we decided to isolate the 5 clusters we created, and subjected them to our df_dummy dataset, grouping and computing the means for each one of our features, to see if any noticeable trends start to arise.

k_means_analysis_df <- df_dummy %>%
  mutate(Cluster = km_res$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean")
k_means_analysis_df <- k_means_analysis_df %>% 
  select('Cluster', 'age', 'trestbps', 'chol', 'thalach', 'oldpeak')

kable(k_means_analysis_df) %>% 
  kable_styling(bootstrap_options = "responsive")
Cluster age trestbps chol thalach oldpeak
1 56.27778 133.0926 325.9815 150.7407 1.1666667
2 51.68468 127.6937 197.7748 151.1802 0.9882883
3 55.77536 134.2101 254.0725 147.9855 1.0311594

[Zach Note: just adding these in here since these were a few of the findings I discovered when running this data in python. We can take these out if they aren’t helpful, but thought I’d leave them here in case others find similar trends or would like to expand on these.]

Conclusions

In the end, we worked through a fair bit of analysis on this heart disease dataset. From this, we’ve come up with a few interesting conclusions:

Many of the attributes that were available in the Cleveland Heart Disease dataset, and the individuals in this dataset, matched many of the underlying factors that can lead to heart disease. For instance, many that did indeed have heart disease exhibited higher resting blood pressure, higher cholesterol, and were older, on average than those that did not have heart disease.

Additionally, a large percentage of individuals that exhibited exercise-induced angina (chest pain) had heart disease (roughly 75% of the individuals in the dataset). This is something worth noting, as it may help physicians with diagnoses and helping to identify factors that could lead to this type of disease.

It was also very interesting to run some machine learning algorithms on this dataset. Support Vector Machines (SVM) and Random Forest both provided pretty good classification outputs for predicting whether or not an individual has heart disease. Implementing more of these types of models in the future may be valuable to help physicians make proper decisions and can aid in diagnoses.

Finally, the SVM model that I used for this dataset appeared to be slightly more effective when it was implemented on the test dataset. Although the Random Forest model did extraordinarily well on the training dataset (accuracy around 99%), when extending this model to the testing dataset it seemed to yield a higher number of false negatives and false positives thant he SVM model. If this model were to be used in real world scenarios, it could lead to catastrophic consequences if this were the only decision-making tool – a relatively large percentage of patients could be wrongfully diagnosed with heart disease, and others would be wrongfully told that they do not have heart disease when they actually do.